#### Chapter 4

The Processor



#### Introduction



- CPU performance factors
- CPU Time=Instruction Count
  ×CPI ×Clock Cycle Time

- Instruction count
  - Determined by ISA and compiler
- CPI and Cycle time
  - Determined by CPU hardware
- We will examine two MIPS implementations
  - A simplified version (Single-cycle implementation)
  - A more realistic pipelined version
  - Multi-cycle version is removed in this version)
- Implement simple subset, but shows most aspects
  - Memory reference: I w, sw
  - Arithmetic/logical: add, sub, and, or, sl t
  - Control transfer: beq, j



# Review: MIPS Instruction Set Architecture (ISA)

- Instruction Categories
  - Arithmetic
  - Load/Store
  - Jump and Branch
  - Floating Point
    - coprocessor
  - Memory Management
  - Special

3 Instruction Formats: all 32 bits wide

| ОР | rs | rt          | rd   | sa         | funct | R format |  |  |  |
|----|----|-------------|------|------------|-------|----------|--|--|--|
| ОР | rs | rt          | imme | ] I format |       |          |  |  |  |
| ОР |    | jump target |      |            |       |          |  |  |  |





# **Review: MIPS Register Convention**

| Name        | Register<br>Number | Usage                  | Preserve on call? |
|-------------|--------------------|------------------------|-------------------|
| \$zero      | 0                  | constant 0 (hardware)  | n.a.              |
| \$at        | 1                  | reserved for assembler | n.a.              |
| \$v0 - \$v1 | 2-3                | returned values        | no                |
| \$a0 - \$a3 | 4-7                | arguments              | yes               |
| \$t0 - \$t7 | 8-15               | temporaries            | no                |
| \$s0 - \$s7 | 16-23              | saved values           | yes               |
| \$t8 - \$t9 | 24-25              | temporaries            | no                |
| \$gp        | 28                 | global pointer         | yes               |
| \$sp        | 29                 | stack pointer          | yes               |
| \$fp        | 30                 | frame pointer          | yes               |
| \$ra        | 31                 | return addr (hardware) | yes               |

# 成功方學

#### Instruction Execution

- PC (Program counter) is used to fetch instruction in the instruction memory)
- After instruction is obtained, register numbers in instructions is used to read registers in register files.
- PC ← PC +4 for sequentially execution





#### Different actions for different instruction classes



- Use ALU to calculate
  - Arithmetic result
  - Memory address for load/store
  - Branch target address
- Access data memory for load/store
- PC ← target address

```
add $t0, $s1, $s2

Iw $s1, 20($s2)

bne $t0, $s5, Exit

Iw $s1, 20($s2)

j Loop
```





## Multiplexers





#### **CPU Overview**



Details of each Mux and Control will be introduced later



# §4.2 Logic Design Conventions

0

#### Logic Design Basics

- Information encoded in binary
  - Low voltage = 0, High voltage = 1
  - One wire per bit
  - Multi-bit data encoded on multi-wire buses
- Combinational element (See next slide)
  - Operate on data
  - Output is a function of input
- State (sequential) elements
  - Output is a function of input and current states
  - Store information



# **Review: Combinational Elements**



AND-gate

$$-Y = A \& B$$

- Multiplexer
  - Y = S? I1: I0



Adder

$$Y = A + B$$



Arithmetic/Logic Unit







#### Review: Sequential Elements

- Register: stores data in a circuit
  - Uses a clock signal to determine when to update the stored value
  - Edge-triggered: update when Clk changes (0-> 1 or 1-> 0)
  - The following figure is positive edge-triggered: update when Clk changes from 0 to 1







# Review: Sequential Elements (with write enable)

- Register with write control
  - Only updates on clock edge when write control input is 1
  - Used when stored value is required later





# **Building a Datapath**

- Datapath: Elements that process data and addresses in the CPU
  - Registers, ALUs, mux's, memories, ...



We will show how to build MIPS datapath



#### **Instruction Fetch**





# 成功方學

b. ALU

#### **R-Format Instructions**

a. Registers

- Read two register operands
- Perform

   arithmetic/logical
   operation



 Write results into destination registers

add \$t0, \$s1, \$s2



# Load/Store Instructions (need 4 components)

- Read register operands =>register files
- Calculate address using 16-bit offset
  - Use ALU, but sign-extend offset
- Load/store: read memory and update register, and write register value to memory
  - Need data memory

lw \$t0, 4(\$s3) #load word from memory sw \$t0, 8(\$s3) #store word to memory











#### Datapath: Load/Store Instruction

#### Load/store





#### Animating the Datapath- load

- Load
  - e.g. lw \$t0, 4(\$s3)

- RN1: register number 1
- RN2: register number 2
- WN: register number that will be written
- WD: write data





#### Animating the Datapath- store

#### store

sw \$t0, 8(\$s3)



# 成功方學

#### Review: Specifying Branch Destinations

MIPS conditional branch instructions:

| ор     | rs     | rt     | offset       |
|--------|--------|--------|--------------|
| 6 bits | 5 bits | 5 bits | 16 bits 2000 |

- PC-relative addressing
  - Target address = PC + offset × 4
  - PC already incremented by 4 by this time from the low order 16 bits of the branch instruction

2000 beq \$s0 \$t1 2 2004 .... 2008 ... 200C Target Address (address of next instruction) =?

200C





#### Datapath: Branch Instructions

- Read register operands
- Compare operands
  - Use ALU, subtract and check Zero output
- Calculate target address
  - Sign-extend offset
  - Shift left 2 bits (word displacement)
  - Add to PC + 4 (already calculated by instruction fetch)



See animation in the next slide





#### Animating the Datapath (beq)

#### beq

e.g. beq \$s0 \$t1 2





# Sign-extension and shift left by 2 hardware known

Simple hardware is used for sign extension and shift





## Composing the Elements

- Make Data path do an instruction in one clock cycle
  - Each datapath element can only do one function at a time
  - Hence, we need separate instruction and data memories
- Use multiplexers where alternate data sources are used for different instructions



# R-Type/Load/Store Datapath

#### A Single Cycle Datapath



Correct Control signal (RegWrite, ALUSrc, ALU operation, MemWrite, MemtoReg, MemRead) are needed to make sure correct operation is done





# Full Datapath (Single Cycle Datapath)



# Control for the single-cycle CPU



- Designing a processor
- A single-cycle implementation (the datapath)
- Control for the single-cycle CPU
  - Control of CPU operations
  - ALU controller
  - Main controller





## Next: Building Datapath With Control





#### Main Control and ALU Control

- Main Control: Based on opcode: generate RegDst, Branch,
   MemRead MemtoReg, ALUOp MemWrite, ALUSrc, RegWrite
- ALU Control: Based on 2-bit ALUop and the 6-bit func field of instruction, the ALU control unit generates the 3-bit ALU control field







### **Deciding ALU Control**

- Assume 2-bit ALUOp derived from opcode
  - Combinational logic derives ALU control

| opcode | ALUOp | Operation        | funct  | ALU function     | ALU control |
|--------|-------|------------------|--------|------------------|-------------|
| lw     | 00    | load word        | XXXXXX | add              | ?           |
| SW     | 00    | store word       | XXXXXX | add              | ?           |
| beq    | 01    | branch equal     | XXXXXX | subtract         | ?           |
| R-type | 10    | add              | 100000 | add              | ?           |
|        |       | subtract         | 100010 | subtract         | ?           |
|        |       | AND              | 100100 | AND              | ?           |
|        |       | OR               | 100101 | OR               | ?           |
|        |       | set-on-less-than | 101010 | set-on-less-than | ?           |



#### **ALU Control**



 ALU Control has 4 four bits: Ainvert, Binvert, and Operation (2 bits)

| Function             | ALU control |
|----------------------|-------------|
| AND                  | 0000        |
| OR                   | 0001        |
| add                  | 0010        |
| subtract             | 0110        |
| set-on-less-<br>than | 0111        |
| NOR                  | 1100        |



If NOR is not supported, we just need three bits (MSB bit is always 0

#### **ALU Control**



#### ALU used for

– Load/Store: function = add

– Branch: function = subtract

R-type: function depends on funct field

#### Assume 2-bit ALUOp derived from opcode

Combinational logic derives ALU control

| opcode | ALUOp  | Operation        | funct  | ALU function     | ALU control |
|--------|--------|------------------|--------|------------------|-------------|
| lw     | 00     | load word        | XXXXXX | add              | 010         |
| sw     | 00     | store word       | XXXXXX | add              | 010         |
| beq    | 01     | branch equal     | XXXXXX | subtract         | 110         |
| R-type | 10 add |                  | 100000 | add              | 010         |
|        |        | subtract         | 100010 | subtract         | 110         |
|        |        | AND              | 100100 | AND              | 000         |
|        |        | OR               | 100101 | OR               | 001         |
|        |        | set-on-less-than | 101010 | set-on-less-than | 111         |



#### **Deciding Main Control Signals**

 Control I signal Instruction<31:0> Inst. <16:20> <31:26> Memory <u>Addr</u> **Funct Imm16** Rd Op Hopefully, **Control** generate **PCsrc** RegDst **ALUSrc** MemWr **MemtoReg Equal** RegWr MemRd **ALUctr** (ALUOp) Datapath





#### Review: The Main Control Unit

#### Control signals derived from instruction







### Control Signals for R-Type Instruction







# Control Signals: 1w Instruction







# Control Signals: sw Instruction







# Control Signals: beq Instruction







#### Review: Target address of Jump

• Assume  $PC=40000000_{16}$ , what is the target address of the jump instruction?

000010 00 0000000 00000010 00000001 6 bits 26 bits

Address in the instruction = 0x0000201

Target Address=  $PC[31:28]+0021_{16}*4= 0x40000804$ 





#### Review: Implementing Jumps

| Jump | 2     | address |
|------|-------|---------|
|      | 31:26 | 25:0    |

- Jump uses word address
- Update PC with concatenation of
  - Top 4 bits of old PC
  - 26-bit jump address
  - -00
- Need an extra control signal decoded from opcode







# Datapath Executing j

32 CONCAT ADD PC+4[31-28] ADD ALUOp 2 **ALU Control** Unit Control op 6 1[31:26] funct \$61[5:0] RD **ADDR** 32 Instruction I Instruction - RegDst Memory Operation 16 Branch RN1 RN2 WN op I[31: RD1 Zero Register File ALU **MemWrite MemtoReg** RD2 **ADDR** RegWrite Data Memory RD **→** WD **MemRead** 



# 成功方學

#### Truth Table for Main Control Signals

- Current design of control is for
  - Iw, sw, beq, and, or, add, sub, slt

See appendix D for details

- I-format: lw, sw, beq
- R-format: and, or, add, sub, slt
- Given 4 OP codes (each 6 bits) as "inputs", the "outputs" are as follows
   => a main control logic (the next slide)

| inputs      | σατρατό |        |        |       |      |       |        |        |        |  |
|-------------|---------|--------|--------|-------|------|-------|--------|--------|--------|--|
|             |         |        | Memto- | Reg   | Mem  | Mem   |        |        |        |  |
| Instruction | RegDst  | ALUSrc | Reg    | Write | Read | Write | Branch | ALUOp1 | ALUOp0 |  |
| R-format    | 1       | 0      | 0      | 1     | 0    | 0     | 0      | 1      | 0      |  |
|             | -       | O      | J      | _     | U    |       | O      | -      | J      |  |
| lw          |         |        |        |       |      |       |        |        |        |  |
| 100011      | 0       | 1      | 1      | 1     | 1    | 0     | 0      | 0      | 0      |  |
| SW          |         |        |        |       |      |       |        |        |        |  |
| 101011      | X       | 1      | Χ      | 0     | 0    | 1     | 0      | 0      | 0      |  |
| beq         |         |        |        |       |      |       |        |        |        |  |
| 000100      | X       | 0      | Χ      | 0     | 0    | 0     | 1      | 0      | 1      |  |

outputs

# Implementation of Main Control Block (Use PLA)



|                | Signal   | R-     | lw | SW | beq |
|----------------|----------|--------|----|----|-----|
|                | name     | format |    |    |     |
|                | Op5      | 0      | 1  | 1  | 0   |
| S              | Op4      | 0      | 0  | 0  | 0   |
| Ĭ              | Op3      | 0      | 0  | 1  | 0   |
| Inputs         | Op2      | 0      | 0  | 0  | 1   |
| H              | Op1      | 0      | 1  | 1  | 0   |
|                | Op0      | 0      | 1  | 1  | 0   |
|                | RegDst   | 1      | 0  | X  | X   |
|                | ALUSrc   | 0      | 1  | 1  | 0   |
|                | MemtoReg | r 0    | 1  | X  | X   |
| ts             | RegWrite | e 1    | 1  | 0  | 0   |
| <b>Outputs</b> | MemRead  | 0      | 1  | 0  | 0   |
| Ħ,             | MemWrite | e 0    | 0  | 1  | 0   |
| Ō              | Branch   | 0      | 0  | 0  | 1   |
|                | ALUOp1   | 1      | 0  | 0  | 0   |
|                | ALUOP0   | 0      | 0  | 0  | 1   |
|                | •        |        |    |    |     |



#### Main control PLA (programmable logic array)

$$RegDst = \overline{Op5} \cdot \overline{Op4} \cdot \overline{Op3} \cdot \overline{Op2} \cdot \overline{Op1} \cdot \overline{Op0}$$
ALUSrc=?

Truth table for main control signals



| Instruction opcode | ALUOp | Instruction operation | Funct field | Desired<br>ALU action | ALU control input |
|--------------------|-------|-----------------------|-------------|-----------------------|-------------------|
| LW                 | 00    | load word             | XXXXXX      | add                   | 0010              |
| SW                 | 00    | store word            | XXXXXX      | add                   | 0010              |
| Branch equal       | 01    | branch equal          | XXXXXX      | subtract              | 0110              |
| R-type             | 10    | add                   | 100000      | add                   | 0010              |
| R-type             | 10    | subtract              | 100010      | subtract              | 0110              |
| R-type             | 10    | AND                   | 100100      | AND                   | 0000              |
| R-type             | 10    | OR                    | 100101      | OR                    | 0001              |
| R-type             | 10    | set on less than      | 101010      | set on less than      | 0111              |

#### Truth Table for ALU control signals

inputs outputs

Merge LW & SW

|   | ALUOp  |        |    | Funct field |    |    |    |    | Operation          |        |
|---|--------|--------|----|-------------|----|----|----|----|--------------------|--------|
|   | ALUOp1 | ALUOp0 | F5 | F4          | F3 | F2 | F1 | F0 |                    |        |
| , | 0      | 0      | X  | Χ           | Χ  | Χ  | Χ  | Χ  |                    | ld     |
|   | 0      | 1      | Χ  | Χ           | Χ  | Χ  | Χ  | Χ  | 0110 <sup>SL</sup> | btract |
|   | 1      | X      | Χ  | Χ           | 0  | 0  | 0  | 0  | 0010 ad            | ld     |
|   | 1      | X      | X  | Χ           | 0  | 0  | 1  | 0  | 0110               | btract |
|   | 1      | X      | X  | Χ           | 0  | 1  | 0  | 0  | 0000 ar            | ıd     |
|   | 1      | X      | X  | Χ           | 0  | 1  | 0  | 1  | 0001               | r      |
|   | 1      | Χ      | X  | Χ           | 1  | 0  | 1  | 0  | 0111 S             | t      |



# 成功方學

## Implementation of ALU Control Block



C2

C0

| ALI    | JOp    |    |    | Function o | ode fields | (  |    |
|--------|--------|----|----|------------|------------|----|----|
| ALUOp1 | ALUOp0 | F5 | F4 | F3         | F2         | F1 | FO |
| X      | 1      | X  | Х  | Х          | X          | Х  | X  |
| 1      | Х      | X  | Х  | X          | X          | 1  | X  |

ALUOp0=1 is able to C1 identify branch instruction

| ALUOp  |        | Function code fields |    |    |    |    |    |  |  |
|--------|--------|----------------------|----|----|----|----|----|--|--|
| ALUOp1 | ALUOp0 | F5                   | F4 | F3 | F2 | F1 | F0 |  |  |
| 0      | X      | Χ                    | X  | X  | Х  | X  | X  |  |  |
| X      | X      | X                    | X  | X  | 0  | X  | X  |  |  |

| ALUOp  |        | Function code fields |    |    |    |    |    |  |  |
|--------|--------|----------------------|----|----|----|----|----|--|--|
| ALUOp1 | ALUOp0 | F5                   | F4 | F3 | F2 | F1 | FO |  |  |
| 1      | X      | Х                    | X  | X  | Х  | X  | 1  |  |  |
| 1      | X      | X                    | X  | 1  | X  | X  | X  |  |  |



C2 = ALUOp0 or (ALUOp1 and F1)

 $C1 = \overline{F2} \text{ or } \overline{ALUOP1}$ 

C0 = ALUOP1 and (F0 or F3)



#### Performance Issues



- Longest delay determines clock period
  - Critical path: load instruction
  - Instruction memory → register file → ALU → data memory → register file
- Not feasible to vary period for different instructions
- Violates design principle
  - Making the common case fast
- We will improve performance by pipelining





# **Backup Slides**

